By Koya Saito
CMSE 201 Section 007, Dr. Finzell
Creator: zrfphoto
|
Credit: Getty Images/iStockphoto
Copyright: zrfphoto
The mission of this project is to make tangible the overwhelming amount of crime in Chicago. While we have enough information to calculate statistics and mark trends, how do we wrap our head around 2.5 million seperate occurances throughout the city? The aim of this project is to visualize crime in such a way that information is added to the CSVs that the data comes in. Added in the sense that it is intuitive to gain an understanding of the geography and draw connections between other forms of geospatial data. In this project crime data from the years 2008 through 2011 are mapped and displayed.
Also addressed in this project is the ethical responsibility of a data scientist in interpretation of data to indentify biases and prevent misinformed assumptions from being made through analysis. While this project displays crime data throughout the entire city of Chicago, it falls short in the breadth of the analysis, only considering one variable, median household income, as an influencing factor to crime. As a follow up to the mapping in this report, included is a brief section on the dangers of misuising statistics.
There are three sections to this project, each containing a different map.
Using the mouse, hover over each to see more information. Using scroll and the pointer, move the map around, focusing on different parts.
# Data processing
import numpy as np
import pandas as pd
import json
from shapely.geometry import shape, Point
# Standard plotting
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Map plotting
import folium
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Import data
chicago_crimes = pd.read_csv('data/Crime/Chicago_Crimes_2008_to_2011.csv', error_bad_lines=False) # Error bad lines False causes some data to be skipped, our dataset may not be a complete representation
income_data = pd.read_csv('data/Income/Total Income by Location.csv')
with open('data/Geo Boundaries/Boundaries - Census Tracts - 2010.geojson') as JsonBounds:
census_geo_JSON = json.load(JsonBounds)
income_data['Formatted ID Geography'] = income_data['ID Geography'].str.slice(7) # Converting ID Geography column to same format as JSON data
income_data = income_data[income_data['Year'] == 2013].reset_index()
income_data.head()
This page was a major contributing factor: https://plotly.com/python/mapbox-county-choropleth/
# Getting range of data
minIncome = min(income_data['Household Income by Race'])
maxIncome = max(income_data['Household Income by Race'])
# This function formats column as though it's money
# https://stackoverflow.com/questions/35019156/pandas-format-column-as-currency
def format(x):
return "${:.1f}K".format(x/1000)
# Defining a row in datafram for mouse hovering
income_data['text'] = 'Median household income: ' + income_data['Household Income by Race'].apply(format).astype(str) + '<br>' + income_data['Geography']
fig = go.Figure(go.Choroplethmapbox(geojson=census_geo_JSON,
featureidkey="properties.geoid10",
locations=income_data["Formatted ID Geography"],
z=income_data['Household Income by Race'],
text=income_data['text'],
hoverinfo='text',
colorscale="Blues",
marker_line_width=1,
marker_line_color='white',
marker_opacity=0.5,
zmin=minIncome,
zmax=maxIncome))
# fig.update_layout(mapbox_style="carto-positron", mapbox_zoom=9, mapbox_center = {"lat": 41.864073, "lon": -87.706819})
fig.update_layout(mapbox_style="light",
mapbox_accesstoken='pk.eyJ1Ijoia295YXMiLCJhIjoiY2toenR4dGd6MHRpczMzbzJmYWVwcnBtNyJ9.JTZCO0J5FbFDj8OkCDIs5w',
mapbox_zoom=9,
mapbox_center = {"lat": 41.864073, "lon": -87.706819},
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
# Income data choropleth trace object
trace1 = go.Choroplethmapbox(geojson=census_geo_JSON,
featureidkey="properties.geoid10",
locations=income_data["Formatted ID Geography"],
z=income_data['Household Income by Race'],
text=income_data['text'],
hoverinfo='text',
colorscale="Blues",
marker_line_width=1,
marker_line_color='white',
marker_opacity=0.5,
zmin=minIncome,
zmax=maxIncome)
Note that we can adjust how many crimes are shown on this map. Analysis of this data would include splitting the crime data in different ways including type of crime, time committed geography and more.
chicago_crime = chicago_crimes[:10000]
# Crime scattermapbox trace object
trace2 = go.Scattermapbox(
lat=chicago_crime['Latitude'],
lon=chicago_crime['Longitude'],
mode='markers',
marker=go.scattermapbox.Marker(
size=3,
color='rgb(242, 177, 172)',
opacity=0.7
),
hoverinfo='text',
text=chicago_crime['Primary Type']
)
fig = make_subplots()
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(mapbox_style="light",
mapbox_accesstoken='pk.eyJ1Ijoia295YXMiLCJhIjoiY2toenR4dGd6MHRpczMzbzJmYWVwcnBtNyJ9.JTZCO0J5FbFDj8OkCDIs5w',
mapbox_zoom=8,
mapbox_center = {"lat": 41.864073, "lon": -87.706819},
)
fig.show()
# Converting dictionary, previously used for Choropleth to DataFrame
census_geo_df = pd.DataFrame.from_dict(census_geo_JSON)
census_geo_df.head()
print(census_geo_df['features'][0])
# How many census tracts are we dealing with?
print(len(census_geo_df.index)) # According to our boundary data
print(len(income_data['Geography'].unique())) # According to our median income data
# Assuming that this checks out with external references (most say 866, we're using 2010 data?)
# What our data points look like?
print(chicago_crimes['Location'].head() )
print('')
print(chicago_crimes['Longitude'].head())
print(chicago_crimes['Latitude'].head())
'''
Creates columns (Census Tract Name and Census Tract geoID) for crime data
indicating what census tract a crime happened in. NaN values are added for
crimes where Lon/Lat data isn't availible or outside of all Census Tracts we're looking at.
For checking if a point is in a polygon references:
https://stackoverflow.com/questions/20776205/point-in-polygon-with-geojson-in-python
http://archived.mhermans.net/geojson-shapely-geocoding.html
https://pypi.org/project/Shapely/
'''
def locateCrimesPerTract(crimes, tracts):
census_tract_column = []
census_geoID_column = []
count = 0
for index,row in crimes.iterrows():
# Progress bar
if count % 100 == 0:
print('Num crimes located: {:8d} | Percent done: {:5.2f}'.format(count, count/len(crimes.index)) , end='\r')
count+=1
# Normally commented out, used for debugging. Prematurely stops loop at given number
# if count >= 1555:
# print('here')
# break
lon_untested = row['Longitude']
lat_untested = row['Latitude']
if np.isnan(lon_untested) or np.isnan(lat_untested):
census_tract_column.append(np.nan)
census_geoID_column.append(np.nan)
continue
else:
lon = lon_untested
lat = lat_untested
point = Point(lon, lat)
tractName = np.nan
geoID = np.nan
for feature in tracts['features']:
polygon = shape(feature['geometry'])
if polygon.contains(point):
tractName = feature['properties']['namelsad10']
geoID = feature['properties']['geoid10']
break
census_tract_column.append(tractName)
census_geoID_column.append(geoID)
return(census_tract_column, census_geoID_column)
For each year we run the counting function, checking what tract each crime happened in. Running these takes a long time and I had to let them run overnight
census_tract_column, census_geoID_column = locateCrimesPerTract(crimes2011, census_geo_df)
print(len(census_tract_column))
print(len(crimes2011.index))
crimes2011 = crimes2011.copy()
crimes2011['GeoID'] = census_geoID_column
crimes2011['Tract'] = census_tract_column
crimes2011.to_csv('data/Crime/Processed Data/crimes2011.csv', index=False)
tract_2008, geoID_2008 = locateCrimesPerTract(crimes2008, census_geo_df)
crimes2008 = crimes2008.copy()
crimes2008['GeoID'] = geoID_2008
crimes2008['Tract'] = tract_2008
crimes2008.to_csv('data/Crime/Processed Data/crimes2008.csv', index=False)
tract_2009, geoID_2009 = locateCrimesPerTract(crimes2009, census_geo_df)
crimes2009 = crimes2009.copy()
crimes2009['GeoID'] = geoID_2009
crimes2009['Tract'] = tract_2009
crimes2009.to_csv('data/Crime/Processed Data/crimes2009.csv', index=False)
tract_2010, geoID_2010 = locateCrimesPerTract(crimes2010, census_geo_df)
crimes2010 = crimes2010.copy()
crimes2010['GeoID'] = geoID_2010
crimes2010['Tract'] = tract_2010
crimes2010.to_csv('data/Crime/Processed Data/crimes2010.csv', index=False)
Combining all the data back into the same DataFrame
processed_crimes2008 = pd.read_csv('data/Crime/Processed Data/crimes2008.csv')
processed_crimes2009 = pd.read_csv('data/Crime/Processed Data/crimes2009.csv')
processed_crimes2010 = pd.read_csv('data/Crime/Processed Data/crimes2010.csv')
processed_crimes2011 = pd.read_csv('data/Crime/Processed Data/crimes2011.csv')
frames = [processed_crimes2008, processed_crimes2009, processed_crimes2010, processed_crimes2011]
processed_crimes = pd.concat(frames)
processed_crimes.head()
A function to filter through our crime data and count how many times each census had an occurence. This function also formats this information and returns it cleanly
def countCrimePerTract(processed_crimes):
crimePerTract = processed_crimes['Tract'].value_counts().rename_axis('Tract').reset_index(name='Num Crimes')
crimePerTract['GeoID'] = processed_crimes['GeoID'].value_counts().rename_axis('GeoID').reset_index(name='Num Crimes')['GeoID'].astype(int).astype(str)
crimePerTract = crimePerTract[['GeoID', 'Tract', 'Num Crimes']]
return crimePerTract
crimePerTract = countCrimePerTract(processed_crimes)
crimePerTract.head()
minCrime = min(crimePerTract['Num Crimes'])
maxCrime = max(crimePerTract['Num Crimes'])
crimeChoropleth = go.Figure(go.Choroplethmapbox(geojson=census_geo_JSON,
featureidkey="properties.geoid10",
locations=crimePerTract['GeoID'],
z=crimePerTract['Num Crimes'],
text=crimePerTract['Num Crimes'],
hoverinfo='text',
colorscale="Reds",
marker_line_width=1,
marker_line_color='white',
marker_opacity=0.5,
zmin=minCrime,
zmax=maxCrime))
crimeChoropleth.update_layout(mapbox_style="light",
mapbox_accesstoken='pk.eyJ1Ijoia295YXMiLCJhIjoiY2toenR4dGd6MHRpczMzbzJmYWVwcnBtNyJ9.JTZCO0J5FbFDj8OkCDIs5w',
mapbox_zoom=8.75,
mapbox_center = {"lat": 41.864073, "lon": -87.706819},
)
crimeChoropleth.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
crimeChoropleth.show()
Map above shows number of crimes committed over the course of 4 years within each census tract.
with open('data/Geo Boundaries/Boundaries - Community Areas (current).geojson') as JsonBounds:
community_geo_JSON = json.load(JsonBounds)
socioeconomic_indicators = pd.read_csv('data/Income/Census_Data_-_Selected_socioeconomic_indicators_in_Chicago__2008___2012.csv')
socioeconomic_indicators['COMMUNITY AREA NAME'] = socioeconomic_indicators['COMMUNITY AREA NAME'].str.upper()
socioeconomic_indicators
indicatorChoropleth = go.Figure(go.Choroplethmapbox(geojson=community_geo_JSON,
featureidkey="properties.community",
locations=socioeconomic_indicators['COMMUNITY AREA NAME'],
z=socioeconomic_indicators['HARDSHIP INDEX'],
text=socioeconomic_indicators['HARDSHIP INDEX'],
hoverinfo='text',
colorscale="Reds",
marker_line_width=1,
marker_line_color='black',
marker_opacity=0.5,
zmin=min(socioeconomic_indicators['HARDSHIP INDEX']),
zmax=max(socioeconomic_indicators['HARDSHIP INDEX'])))
indicatorChoropleth.update_layout(mapbox_style="light",
mapbox_accesstoken='pk.eyJ1Ijoia295YXMiLCJhIjoiY2toenR4dGd6MHRpczMzbzJmYWVwcnBtNyJ9.JTZCO0J5FbFDj8OkCDIs5w',
mapbox_zoom=8.75,
mapbox_center = {"lat": 41.864073, "lon": -87.706819},
)
indicatorChoropleth.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
indicatorChoropleth.show()
This project is not in the position to make any difinitive claims about the nature of crime and its connection to the discussed variables. However, we can review the visualizations we made.
Each of the maps displays one aespect of the problem. Due to the fact that we were able to base the city gradients off of the minimum and maximum values from the data, each map is a normalized view of the respective variable, helping to understand each on a local scale as compared to the rest of the city. Another great thing about these maps is their potential to be expanded. The City of Chicago data portal has an incredible ammount of diverse data to be analyzed.
This section of the project is a short introduction to the moral ambiguity of data analysis. Due to the fact that it is an analyst's job to try and understand a dynamic system, one function of the job is to create assumptions and generalizations in order to understand the assumed underlying patterns of data. In general, this is a great strategy which has led to the rapid success of the field but has moral implications when these assumptions are being used to evaluate and make decisions for the wellbeing of human beings. While doing analysis of people for people, it is important not only to acknowledge the complexity of human systems but to be certain that all potential influencing factors are accounted for.
The following resource is a great example of how the immoral use of data analysis has been harmful in the past. This video is great for two reasons. 1) It gives examples of data analysis being misused both intentionally and unintentionally. 2) It acknowledges the human ability to take facts out of context, despite the common misconception that numbers and logic provide unshakable evidence.
from IPython.display import YouTubeVideo
YouTubeVideo("bVG2OQp6jEQ",width=640,height=360)
# https://youtu.be/bVG2OQp6jEQ
The following section is a snippet of code which could have been a part of this project if not for the precautions of responsible data analysis. Dangerous for several reasons, the primary one is the most subtle. Displaying data in this way causes an uninformed viewer to come to a conclusion they think they reached on their own. The data is laid out in such a way that, despite being "technically true" fails to capture the full context of the problem and vastly oversimplifies the issue it represents.
# Creating a DataFrame with crime and income together
tractInfo = crimePerTract
median_incomes = []
for index, row in tractInfo.iterrows():
geoID = row['GeoID']
# median_incomes.append(income_data['Household Income by Race'][income_data['Formatted ID Geography'] == geoID])
ind = income_data.index[income_data['Formatted ID Geography'] == geoID].tolist()
if len(ind) > 0:
median_incomes.append(income_data['Household Income by Race'][ind[0]])
else:
median_incomes.append(np.nan)
tractInfo['Median Household Income'] = median_incomes
tractInfo.head()
tractInfo.plot(kind='scatter', x='Num Crimes', y='Median Household Income')
Finally, a famous example of unintentional bias in data science: Survivorship bias.
Quoted from Wikipedia, "Survivorship bias or survival bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility." This is a type of bias which arises from the ignorance of certain variables due to the inability to collect data points representing these variables. By anaylzying a dataset not including a representative sample, one ignores the anomaly and fundimentally, the group entirely.
The example scenario takes place in a war. A general sends a fleet of planes on a mission, off to bomb a target. The fleet of planes successfully completes the bombing, losing several planes but a majority returning home nonetheless. Upon returning home, the general has the army's data scientists conduct an analysis of the returning planes to aid the engineers in upgrading them. These upgrades consist of reinforcing certain parts of each plane, armoring parts making them more resistant to being shot with a tradeoff of making the aircraft heavier. The data scientists mark all the locations on the planes where bullet holes are found, and return with the above diagram. They infer that since bullets are clearly clustered around the locations with red dots, they will tell the engineers to armor in these plaecs. This is where the bias arises.
The data scientist's mistake is assuming the places with bullet holes need armor. The planes who are shot in these places crash, and are therefore not a part of the dataset consisting of returning planes. This deduction and therein bias is famously called the survivorship bias and is relevant to the topics discussed in this project.
For more specific examples of bias, view this towardsdatascience article on different types of bias.